{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "view-in-github"
   },
   "source": [
    "<a href=\"https://colab.research.google.com/github/NeuromatchAcademy/course-content/blob/master/tutorials/W1D3_ModelFitting/student/W1D3_Tutorial6.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "\n",
    "# Neuromatch Academy: Week 1, Day 3, Tutorial 6\n",
    "# Model Selection: Cross-validation\n",
    "\n",
    "\n",
    "**Content creators**: Pierre-Étienne Fiquet, Anqi Wu, Alex Hyafil with help from Ella Batty\n",
    "\n",
    "**Content reviewers**: Lina Teichmann, Patrick Mineault, Michael Waskom\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "---\n",
    "#Tutorial Objectives\n",
    "\n",
    "This is Tutorial 6 of a series on fitting models to data. We start with simple linear regression, using least squares optimization (Tutorial 1) and Maximum Likelihood Estimation (Tutorial 2). We will use bootstrapping to build confidence intervals around the inferred linear model parameters (Tutorial 3). We'll finish our exploration of regression models by generalizing to multiple linear regression and polynomial regression (Tutorial 4). We end by learning how to choose between these various models. We discuss the bias-variance trade-off (Tutorial 5) and Cross Validation for model selection (Tutorial 6).\n",
    "\n",
    "Tutorial objectives:\n",
    "* Implement cross-validation and use it to compare polynomial regression model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "---\n",
    "# Setup"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "both",
    "colab": {},
    "colab_type": "code"
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "from sklearn.model_selection import KFold"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "colab": {},
    "colab_type": "code"
   },
   "outputs": [],
   "source": [
    "#@title Figure Settings\n",
    "%config InlineBackend.figure_format = 'retina'\n",
    "plt.style.use(\"https://raw.githubusercontent.com/NeuromatchAcademy/course-content/master/nma.mplstyle\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "colab": {},
    "colab_type": "code"
   },
   "outputs": [],
   "source": [
    "#@title Helper functions\n",
    "def ordinary_least_squares(x, y):\n",
    "  \"\"\"Ordinary least squares estimator for linear regression.\n",
    "\n",
    "  Args:\n",
    "    x (ndarray): design matrix of shape (n_samples, n_regressors)\n",
    "    y (ndarray): vector of measurements of shape (n_samples)\n",
    "\n",
    "  Returns:\n",
    "    ndarray: estimated parameter values of shape (n_regressors)\n",
    "  \"\"\"\n",
    "\n",
    "  return np.linalg.inv(x.T @ x) @ x.T @ y\n",
    "\n",
    "def make_design_matrix(x, order):\n",
    "  \"\"\"Create the design matrix of inputs for use in polynomial regression\n",
    "\n",
    "  Args:\n",
    "    x (ndarray): input vector of shape (n_samples)\n",
    "    order (scalar): polynomial regression order\n",
    "\n",
    "  Returns:\n",
    "    ndarray: design matrix for polynomial regression of shape (samples, order+1)\n",
    "  \"\"\"\n",
    "\n",
    "  # Broadcast to shape (n x 1)\n",
    "  if x.ndim == 1:\n",
    "    x = x[:, None]\n",
    "\n",
    "  #if x has more than one feature, we don't want multiple columns of ones so we assign\n",
    "  # x^0 here\n",
    "  design_matrix = np.ones((x.shape[0], 1))\n",
    "\n",
    "  # Loop through rest of degrees and stack columns\n",
    "  for degree in range(1, order + 1):\n",
    "      design_matrix = np.hstack((design_matrix, x**degree))\n",
    "\n",
    "  return design_matrix\n",
    "\n",
    "\n",
    "def solve_poly_reg(x, y, max_order):\n",
    "  \"\"\"Fit a polynomial regression model for each order 0 through max_order.\n",
    "\n",
    "  Args:\n",
    "    x (ndarray): input vector of shape (n_samples)\n",
    "    y (ndarray): vector of measurements of shape (n_samples)\n",
    "    max_order (scalar): max order for polynomial fits\n",
    "\n",
    "  Returns:\n",
    "    dict: fitted weights for each polynomial model (dict key is order)\n",
    "  \"\"\"\n",
    "\n",
    "  # Create a dictionary with polynomial order as keys, and np array of theta\n",
    "  # (weights) as the values\n",
    "  theta_hats = {}\n",
    "\n",
    "  # Loop over polynomial orders from 0 through max_order\n",
    "  for order in range(max_order + 1):\n",
    "\n",
    "    X = make_design_matrix(x, order)\n",
    "    this_theta = ordinary_least_squares(X, y)\n",
    "\n",
    "    theta_hats[order] = this_theta\n",
    "\n",
    "  return theta_hats\n",
    "\n",
    "def evaluate_poly_reg(x, y, theta_hats, max_order):\n",
    "    \"\"\" Evaluates MSE of polynomial regression models on data\n",
    "\n",
    "    Args:\n",
    "      x (ndarray): input vector of shape (n_samples)\n",
    "      y (ndarray): vector of measurements of shape (n_samples)\n",
    "      theta_hat (dict):  fitted weights for each polynomial model (dict key is order)\n",
    "      max_order (scalar): max order of polynomial fit\n",
    "\n",
    "    Returns\n",
    "      (ndarray): mean squared error for each order, shape (max_order)\n",
    "    \"\"\"\n",
    "\n",
    "    mse = np.zeros((max_order + 1))\n",
    "    for order in range(0, max_order + 1):\n",
    "      X_design = make_design_matrix(x, order)\n",
    "      y_hat = np.dot(X_design, theta_hats[order])\n",
    "      residuals = y - y_hat\n",
    "      mse[order] = np.mean(residuals ** 2)\n",
    "\n",
    "    return mse"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "---\n",
    "# Section 1: Cross-validation\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "colab": {},
    "colab_type": "code"
   },
   "outputs": [],
   "source": [
    "#@title Video 1: Cross-Validation\n",
    "from IPython.display import YouTubeVideo\n",
    "video = YouTubeVideo(id=\"OtKw0rSRxo4\", width=854, height=480, fs=1)\n",
    "print(\"Video available at https://youtube.com/watch?v=\" + video.id)\n",
    "video"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "We now have multiple choices for which model to use for a given problem: we could use linear regression, order 2 polynomial regression, order 3 polynomial regression, etc. As we saw in Tutorial 5, different models will have different quality of predictions, both on the training data and on the test data.\n",
    "\n",
    "A commonly used method for model selection is to asks how well the model predicts new data that it hasn't seen yet. But we don't want to use test data to do this, otherwise that would mean using it during the training process! One approach is to use another kind of held-out data which we call **validation data**: we do not fit the model with this data but we use it to select our best model.\n",
    "\n",
    "We often have a limited amount of data though (especially in neuroscience), so we do not want to further reduce our potential training data by reassigning some as validation. Luckily, we can use **k-fold cross-validation**! In k-fold cross validation, we divide up the training data into k subsets (that are called *folds*, see diagram below), train our model on the first k-1 folds, and then compute error on the last held-out fold. We can then repeat this process k times, once on each k-1 folds of the data. Each of these k instances (which are called *splits*, see diagram below) excludes a different fold from fitting. We then average the error of each of the k trained models on its held-out subset - this is the final measure of performance which we can use to do model selection. \n",
    "\n",
    "To make this explicit, let's say we have 1000 samples of training data and choose 4-fold cross-validation. Samples 0 - 250 would be subset 1, samples 250 - 500 subset 2, samples 500 - 750 subset 3, and samples 750-1000 subset 4. First, we train an order 3 polynomial regression on subsets 1, 2, 3 and evaluate on subset 4. Next, we train an order 3 polynomial model on subsets 1, 2, 4 and evalute on subset 3. We continue until we have 4 instances of a trained order 3 polynomial regression model, each with a different subset as held-out data, and average the held-out error from each instance.\n",
    "\n",
    "We can now compare the error of different models to pick a model that generalizes well to held-out data. We can choose the measure of prediction quality to report error on the held-out subsets to suit our purposes. We will use MSE here but we could also use log likelihood of the data and so on.\n",
    "\n",
    "As a final step, it is common to retrain this model on all of the training data (without subset divisions) to get our final model that we will evaluate on test data. This approach allows us to evaluate the quality of predictions on new data without sacrificing any of our precious training data. \n",
    "\n",
    "Note that the held-out subsets are called either validation or test subsets. There is not a consensus and may depend on the exact use of k-fold cross validation. Sometimes people use k-fold cross validation to choose between different models/parameters to then apply to held-out test data and sometimes people report the averaged error on the held-out subsets as the model performance.   If you are doing the former (using k-fold cross validation for model selection), you must report performance on held-out test data! In this text/code, we will refer to them as validation subsets to differentiate from our completely held-out test data.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "These steps are summarized in this diagram from Scikit-learn (https://scikit-learn.org/stable/modules/cross_validation.html)\n",
    "\n",
    "![Diagram from Sklearn](https://scikit-learn.org/stable/_images/grid_search_cross_validation.png) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "Importantly, we need to be very careful when dividing the data into subsets.  The held-out subset should not be used in any way to fit the model. We should not do any preprocessing (e.g. normalization) before we divide into subsets or the held-out subset could influence the training subsets. A lot of false-positives in cross-validation come from wrongly dividing. \n",
    "\n",
    "An important consideration in the choice of model selection method are the relevant biases. If we just fit using MSE on training data, we will generally find that fits get better as we add more parameters because the model will overfit the data, as we saw in Tutorial 5. When using cross-validation, the bias is the other way around. Models with more parameters are more affected by variance so cross-validation will generally prefer models with fewer parameters.\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "\n",
    "\n",
    "We will again simulate some train and test data and fit polynomial regression models\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "form",
    "colab": {},
    "colab_type": "code"
   },
   "outputs": [],
   "source": [
    "#@title\n",
    "#@markdown Execute this cell to simulate data and fit polynomial regression models\n",
    "\n",
    "### Generate training data\n",
    "np.random.seed(0)\n",
    "n_train_samples = 50\n",
    "x_train = np.random.uniform(-2, 2.5, n_train_samples)  # sample from a uniform distribution over [-2, 2.5)\n",
    "noise = np.random.randn(n_train_samples)  # sample from a standard normal distribution\n",
    "y_train = x_train**2 - x_train - 2 + noise\n",
    "\n",
    "### Generate testing data\n",
    "n_test_samples = 20\n",
    "x_test = np.random.uniform(-3, 3, n_test_samples)  # sample from a uniform distribution over [-2, 2.5)\n",
    "noise = np.random.randn(n_test_samples)  # sample from a standard normal distribution\n",
    "y_test = x_test**2 - x_test - 2 + noise\n",
    "\n",
    "### Fit polynomial regression models\n",
    "max_order = 5\n",
    "theta_hats = solve_poly_reg(x_train, y_train, max_order)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "## Exercise 1: Implement cross-validation\n",
    "\n",
    "Given our set of models to evaluate (polynomial regression models with orders 0 through 5), we will use cross-validation to determine which model has the best predictions on new data according to MSE. \n",
    "\n",
    "In this code, we split the data into 10 subsets using `Kfold` (from `sklearn.model_selection`). `KFold` handles cross-validation subset splitting and train/val assignments.  In particular, the `Kfold.split` method returns an iterator which we can loop through. On each loop, this iterator assigns a different subset as validation and returns new training and validation indices with which to split the data. \n",
    "\n",
    "We will loop through the 10 train/validation splits and fit several different polynomial regression models (with different orders) for each split. You will need to use the `solve_poly_reg` method from Tutorial 4 and `evaluate_poly_reg` from Tutorial 5 (already implemented in this notebook).\n",
    "\n",
    "We will visualize the validation MSE over 10 splits of the data for each polynomial order using box plots."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code"
   },
   "outputs": [],
   "source": [
    "def cross_validate(x_train, y_train, max_order, n_splits):\n",
    "  \"\"\" Compute MSE for k-fold validation for each order polynomial\n",
    "\n",
    "  Args:\n",
    "    x_train (ndarray): training data input vector of shape (n_samples)\n",
    "    y_train (ndarray): training vector of measurements of shape (n_samples)\n",
    "    max_order (scalar): max order of polynomial fit\n",
    "    n_split (scalar): number of folds for k-fold validation\n",
    "\n",
    "  Return:\n",
    "    ndarray: MSE over splits for each model order, shape (n_splits, max_order + 1)\n",
    "\n",
    "  \"\"\"\n",
    "\n",
    "  # Initialize the split method\n",
    "  kfold_iterator = KFold(n_splits)\n",
    "\n",
    "  # Initialize np array mse values for all models for each split\n",
    "  mse_all = np.zeros((n_splits, max_order + 1))\n",
    "\n",
    "  for i_split, (train_indices, val_indices) in enumerate(kfold_iterator.split(x_train)):\n",
    "\n",
    "      # Split up the overall training data into cross-validation training and validation sets\n",
    "      x_cv_train = x_train[train_indices]\n",
    "      y_cv_train = y_train[train_indices]\n",
    "      x_cv_val = x_train[val_indices]\n",
    "      y_cv_val = y_train[val_indices]\n",
    "\n",
    "      #############################################################################\n",
    "      ## TODO for students: Fill in missing ... in code below to choose which data\n",
    "      ## to fit to and compute MSE for\n",
    "      # Fill out function and remove\n",
    "      raise NotImplementedError(\"Student exercise: implement cross-validation\")\n",
    "      #############################################################################\n",
    "\n",
    "      # Fit models\n",
    "      theta_hats = ...\n",
    "\n",
    "      # Compute MSE\n",
    "      mse_this_split = ...\n",
    "\n",
    "      mse_all[i_split] = mse_this_split\n",
    "\n",
    "  return mse_all\n",
    "\n",
    "\n",
    "max_order = 5\n",
    "n_splits = 10\n",
    "plt.figure()\n",
    "\n",
    "# Uncomment below to test function\n",
    "# mse_all = cross_validate(x_train, y_train, max_order, n_splits)\n",
    "# plt.boxplot(mse_all, labels=np.arange(0, max_order + 1))\n",
    "\n",
    "plt.xlabel('Polynomial Order')\n",
    "plt.ylabel('Validation MSE')\n",
    "plt.title(f'Validation MSE over {n_splits} splits of the data');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "cellView": "both",
    "colab": {},
    "colab_type": "text"
   },
   "source": [
    "[*Click for solution*](https://github.com/NeuromatchAcademy/course-content/tree/master//tutorials/W1D3_ModelFitting/solutions/W1D3_Tutorial6_Solution_5313359d.py)\n",
    "\n",
    "*Example output:*\n",
    "\n",
    "<img alt='Solution hint' align='left' width=558 height=414 src=https://raw.githubusercontent.com/NeuromatchAcademy/course-content/master/tutorials/W1D3_ModelFitting/static/W1D3_Tutorial6_Solution_5313359d_0.png>\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "Which polynomial order do you think is a better model of the data?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "---\n",
    "# Summary\n",
    "\n",
    "We need to use model selection methods to determine the best model to use for a given problem. \n",
    "\n",
    "Cross-validation focuses on how well the model predicts new data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "---\n",
    "# Appendix"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "## Akaike's Information Criterion (AIC)\n",
    "\n",
    "In order to choose the best model for a given problem, we can ask how likely the data is under a given model. We want to choose a model that assigns high probability to the data. A commonly used method for model selection that uses this approach is **Akaike’s Information Criterion (AIC)**.\n",
    "\n",
    "Essentially, AIC estimates how much information would be lost if the model predictions were used instead of the true data (the relative information value of the model). We compute the AIC for each model and choose the model with the lowest AIC. Note that AIC only tells us relative qualities, not absolute - we do not know from AIC how good our model is independent of others.\n",
    "\n",
    "AIC strives for a good tradeoff between overfitting and underfitting by taking into account the complexity of the model and the information lost. AIC is calculated as:\n",
    "\n",
    "$$ AIC = 2K - 2 log(L)$$\n",
    "\n",
    "where K is the number of parameters in your model and L is the likelihood that the model could have produced the output data. \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "Now we know what AIC is, we want to use it to pick between our polynomial regression models. We haven't been thinking in terms of likelihoods though - so how will we calculate L? \n",
    "\n",
    "As we saw in Tutorial 2, there is a link between mean squared error and the likelihood estimates for linear regression models that we can take advantage of. \n",
    "\n",
    "*Derivation time!*\n",
    "\n",
    "We start with our formula for AIC from above:\n",
    "\n",
    "$$ AIC = 2k - 2 log L $$\n",
    "\n",
    "For a model with normal errors, we can use the log likelihood of the normal distribution:\n",
    "\n",
    "$$ \\log L = -\\frac{n}{2} \\log(2 \\pi) -\\frac{n}{2}log(\\sigma^2) - \\sum_i^N \\frac{1}{2 \\sigma^2} (y_i - \\tilde y_i)^2$$\n",
    "\n",
    "We can drop the first as it is a constant and we're only assessing relative information with AIC. The last term is actually also a constant: we don't know $\\sigma^2$ in advance so we use the empirical estimate from the residual ($\\hat{\\sigma}^2 = 1/N\\sum_i^N (y_i - \\tilde y_i)^2$). Once we plug this in, the two $\\sum [(y - \\tilde y)^2]$ terms (in the numerator and denominator, respectively) cancel out and we are left with the last term as $\\frac N 2$. \n",
    "\n",
    "Once we drop the constant terms and incorporate into the AIC formula we get:\n",
    "\n",
    "$$AIC = 2k + nlog(\\sigma^2)$$\n",
    "\n",
    "We can replace $\\sigma^2$ with the computation for variance (the sum of squared errors divided by number of samples). Thus, we end up with the following formula for AIC for linear and polynomial regression:\n",
    "\n",
    "$$ AIC = 2K + n log(\\frac{SSE}{n})$$\n",
    "\n",
    "where k is the number of parameters, n is the number of samples, and SSE is the summed squared error.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "### Bonus Exercise: Compute AIC"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code"
   },
   "outputs": [],
   "source": [
    "AIC_list = []\n",
    "order_list = list(range(max_order + 1))\n",
    "\n",
    "for order in order_list:\n",
    "\n",
    "  # Compute predictions for this model\n",
    "  X_design = make_design_matrix(x_train, order)\n",
    "  y_hat = np.dot(X_design, theta_hats[order])\n",
    "\n",
    "  ############################################################################\n",
    "  ## TODO for students: Compute AIC for this order polynomial regression model\n",
    "  # 1) Compute sum of squared errors (SSE) given prediction y_hat and y_train\n",
    "  # 2) Identify number of parameters in this model (K in formula above)\n",
    "  # 3) Compute AIC (call this_AIC) according to formula above\n",
    "  ############################################################################\n",
    "\n",
    "  # Compute SSE\n",
    "  residuals = ...\n",
    "  sse = ...\n",
    "\n",
    "  # Get K\n",
    "  K = len(theta_hats[order])\n",
    "\n",
    "  # Compute AIC\n",
    "  AIC = ...\n",
    "\n",
    "  AIC_list.append(AIC)\n",
    "\n",
    "# Uncomment to test your code\n",
    "# plt.bar(order_list, AIC_list)\n",
    "\n",
    "plt.ylabel('AIC')\n",
    "plt.xlabel('polynomial order')\n",
    "plt.title('comparing polynomial fits')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab": {},
    "colab_type": "text"
   },
   "source": [
    "[*Click for solution*](https://github.com/NeuromatchAcademy/course-content/tree/master//tutorials/W1D3_ModelFitting/solutions/W1D3_Tutorial6_Solution_5682fff2.py)\n",
    "\n",
    "*Example output:*\n",
    "\n",
    "<img alt='Solution hint' align='left' width=560 height=416 src=https://raw.githubusercontent.com/NeuromatchAcademy/course-content/master/tutorials/W1D3_ModelFitting/static/W1D3_Tutorial6_Solution_5682fff2_0.png>\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text"
   },
   "source": [
    "Which model would we choose based on AIC? "
   ]
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [],
   "include_colab_link": true,
   "name": "W1D3_Tutorial6",
   "provenance": [],
   "toc_visible": true
  },
  "kernel": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.8"
  },
  "toc-autonumbering": true
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
